Starbucks Capstone Challenge - Data Segmentation

Here, will attempt to shocase the different demographic groups of users that share similar behaviour on the Starbucks mobile app. I will be using the following methods:

  1. FMT segmentation (frequency, monetary value, tenure) - group users into quantiles for each of these metrics and score customers using all 3 quantile features, then segments will be created based on the total scores
    • Frequency - how often the user made a transaction
    • Monetary value - how much money the user spent
    • Tenure - how long the user has been using the app
    • This is normally RFM (recency, frequency, monetary value), but recency is not useful here with only 1 month of activity data, so tenure is replacing recency
  1. K-means clustering - group users into clusters based on demographic and FMT features

Part I: Dependencies and data

Part II: Quantile Segmentation - Frequency, Monetary, Tenure

This takes into account RFM (recency, frequency, monetary) segmentation, but since there is only 1 month of data, recency won't be of any use. Instead of recency, I will be looking at tenure. For each customer, this section will explore:

1. User FMT - frequency, monetary value, and tenure

The above mentioned FMT values are calculated for each user as-

2. FMT segmentation

The FMT values are reduced to quantiles in order to score each user. The number of quantiles chosen for each feature is the weight for which the segmentation is based on:

** Example: a user in the top quantile for all 3 features will get a total score of 17 (6 + 8 + 3)

The weights are chosen arbitrarily, but the general idea is that monetary value is the most important feature here since it is the main focus of this project. It is followed by frequency and then tenure, which would both have more weight if we are focusing on how customers are using the app, but this is not the case.

The individual scores for each are summed up for a maximum of 17 points, which created 15 cohorts of users (3 - 17 points). Since there are too many cohorts and the cohort sizes aren't very balanced, I grouped the cohorts in 3 tiers of customers: red customers have a total score between 3 and 7, yellow between 8 and 12, and blue between 13 and 17.

As seen in the snake plot above, the average of all 3 FMT metrics increase with an increasing user tier, although not at the same rate.

3. Completion rate of customer tiers

4. Customer tier demographics</a>

With an increasing customer tier, we can see the following changes:

As red users do not spend a lot of money, they're not very likely to respond to offers so it would be a good idea to either stop sending them offers or to only send them offers that are easy to complete.

Yellow users do spend quite a bit more than red users, so it's actually worth it to be sending them offers. They completed a good portion of discount offers 3 and 4 that were sent out, so focusing on the easier offers or lowering the difficulty of the harder offers would likely result in an increase the number of offers that are completed.

Blue users consistently have a high rate of offer completion so it would actually benefit Starbucks to increase the difficulty of offers being sent to these users. As they are highly likely to respond to offers, a higher difficulty would likely increase the amount these customers spend.

6.Principal Component Analysis (PCA)

Gender is reduced to a binary feature that indicates whether the user is "male" because there is a very low number of "other" users so they are grouped with "female" users to represent non-male users. Since this is still a categorical feature and K-means clustering isn't meant for categorical features. So use PCA instaed.

I will try to characterize each component based on its loadings.

  1. Component 1 - high spending amount
  2. Component 2 - low income, high spending frequency
  3. Component 3 - female, young, newer to the app
  4. Component 4 - male, high spending amount, newer to the app
  5. Component 5 - older, lower income, newer to the app

7. K-means clustering

Using the elbow method, I initially decided to create 7 clusters. But with 7 clusters, the cluster size is imbalanced. After several trial and erroe method, found 4 clusters to be optimal because the clusters are distinguishable and more interpretable.

The main identifying characteristics for each cluster are:

The clusters were named using the snake plot. The general description of each cluster is as follows:

8. Cluster demographics

  1. Low spenders- Male
  1. Low spenders- Female
  1. Low earners
  1. High earners

9. Completion rate of clusters

In the barplot above, the customer clusters are ordered based on their average monetary value, i.e. low spenders - male spent the least and high earners spent the most money. We can see for most offers that as a group's average spending increases, so does the rate at which they complete offers.

The completion rates of low spenders - male, low earners and high earners look similar to those of the bronze, silver, and gold tiers from the FMT segmentation respectively. So the same suggestions would apply to these clusters.

Save data